Skip to content

Documentation re-structure#3300

Open
githubnemo wants to merge 24 commits into
huggingface:mainfrom
githubnemo:feature/doc-restructuring
Open

Documentation re-structure#3300
githubnemo wants to merge 24 commits into
huggingface:mainfrom
githubnemo:feature/doc-restructuring

Conversation

@githubnemo

@githubnemo githubnemo commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

The current state of the PEFT docs is not one of structure and I was constantly annoyed that whenever I wanted to change something there were several places that needed touching and they all felt disconnected. So this is my attempt at structuring the docs. Some of these ideas are quite old (discussed in 01/2025) but are still valid.

I've removed most of the code guides without replacement. That's not ideal, I think we should have code examples but I'm think they should be method-focused. Maybe one general example of a training workflow is sufficient because most methods follow the same scheme. I'd appreciate some feedback on this.

All details from the method guides (prompting, lora, oft/boft, etc.) are now integrated into the respective method pages instead. I would have hesitated to do this if these guides would have integrated information about the adapters but they didn't. I think it makes a lot more sense to have one place for each method to gather examples/tips/recommendations and that is now the package_refernce/<method> page. This page now also hosts a small space that shows the MetaMathQA (and potentially other) benchmark results highlighted for that method.

I've moved the LoRA initializations to package_reference/lora#Initialization and converted the init methods to <hfoption>-tags. This collapses them to a list but may reduce searchability through the document - at least firefox is not able to search 'through' the option tabs. This also doesn't make them appear in the ToC and people specifically searching for, say, PiSSA won't find it directly. I think that's OK though, since the search is able to locate it.

The quicktour is a bit more detailed about what happens under the hood (quick doesn't have to mean simplistic) and includes some new visualizations. I hope that we can integrate more visualizations in the future where it makes sense.

@BenjaminBossan

Copy link
Copy Markdown
Member

Thanks a lot for revamping the PEFT docs, which I agree are not very user friendly at the moment. Could you please resolve the two merge conflicts so that preview docs could be rendered? I think it makes more sense to review the docs as a whole than going through the diff (which is probably showing a lot of text that has just moved places).

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:

  1. Did you update doc links we may have in PEFT to ensure that they'll be up to date?
  2. How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

nemo added 2 commits June 4, 2026 13:10
The space was not that useful anymore since most methods are compatible
with most models.

The front page buttons are, at least temporarily, with the exception
of the quicktour and method overview buttons. I like the visuals
but there should only be elements that are useful.
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@githubnemo

Copy link
Copy Markdown
Collaborator Author

One concern that I have is that links to the PEFT docs could break with the new structure. Thus I have two questions:

1. Did you update doc links we may have in PEFT to ensure that they'll be up to date?

I didn't at the time but now I have. There were 14 occurrences of now broken links, all of which are fixed now.

2. How do we deal with external links? It could be e.g. other repos (say, Axolotl, Hermes skill, etc.) but could also concern HF repos (e.g. links from PEFT or Transformers issues).

I've added a _redirects.yml with the most common redirects I found (mostly from transformers). I also checked axolotl, diffusers and unsloth - the latter was not easy to analyze systematically as I couldn't find the docs as plain text, so I resorted to delegating to an agent which didn't find references to the PEFT docs.

The Hermes PEFT skill (https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/mlops/peft) doesn't seem to link to changed pages in the docs.

@githubnemo

Copy link
Copy Markdown
Collaborator Author

@BenjaminBossan

Copy link
Copy Markdown
Member

githubnemo added a commit that referenced this pull request Jun 8, 2026
PR #3300 drafts the idea of embedding the method comparison results into the
respective method pages. This calls for a lighter version of the existing space
to limit the needed space. This is what `app_embed.py` is.

Most of the common processing has moved to the existing and aptly named
`processing.py`.

I think that this is better than having a layout switch in `app.py` as
these apps are meant to be as flat as can be to be readable and maintainable.
@githubnemo githubnemo marked this pull request as ready for review June 8, 2026 12:54
@githubnemo

Copy link
Copy Markdown
Collaborator Author

I think this is now ready for review. Sorry about the huge PR but dissolving the guides into the individual method pages made a relatively big splash in terms of changes, even though the individual changes are quite small.

@stevhliu it would be super cool if you could take a look as well :)

When reviewing the rendered doc on moon-ci-docs I noticed that the new images are rendered with borders (esp. visible in the quicktour) and the ToC indentation for LoRA variants is broken but I have no clue how to fix this. @stevhliu do you have an idea?

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a LOT for working on overhauling the PEFT docs. They always felt lacking and suboptimally structured to me, so I'm very happy to see improvements there.

For this review, I focused on the general sections but haven't reviewed the entries for the individual PEFT methods. This was in order to break down the review in smaller parts, as I'm not going to finish it today. It may also help avoid duplicate effort between me and Steven.

As a more general comment, I saw that some added parts contain manual line breaks, e.g. in overview.md. I would suggest to remove those completely.

I like the idea of including a benchmark overview for each PEFT method. Now that we have image generation too, it would be great to add an option to toggle the benchmark, but let's leave that to a future PR. I noticed, however, that not each PEFT method includes the benchmark, e.g. HRA is missing it. Also, some methods like HiRA have the graph but no corresponding data points, but maybe its result was added after the space was deployed?

I also wonder if we should not fully remove the legend, as the resulting graph can become quite cramped:

Image

There is also a bit of an inconsistency about the legend, e.g. for Lily it only labels the line but not the points. I think it should be removed for simplicity.

Comment thread docs/source/index.md
<div class="flex flex-col basis-1/4">
There are numerous methods to "adapt" existing models, often extensively integrating into the model. PEFT can be thought of as a framework for arbitrary methods of model adaption (modifying weights, wrapping layers, manipulating KV-caches, ...) while also serving as a reference implementation for many fine-tuning methods.
</div>
<div class="flex flex-col basis-3/4 pl-10 pr-10"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/adapter_installation.png" width="100%"></div>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit odd (middle part) with dark theme:

Image

Comment thread docs/source/quicktour.md

## Multiple adapters

PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters by calling `peft_model.add_adapter(adapter_name=...)`.
PEFT supports installing multiple adapters (of the same kind, in this document this would be LoRA) on top of a base model. When you call `get_peft_model` there is only one adapter named `"default"` but you can add as many additional adapters as you want by calling `peft_model.add_adapter(adapter_name=...)`.

Comment thread docs/source/quicktour.md
model = AutoPeftModel.from_pretrained("smangrul/openai-whisper-large-v2-LORA-colab")
```

## Multiple adapters

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the section above, the docs describe the AutoPeftModel API for loading trained adapters. I'm just wondering if we should not at the very least mention the PeftModel.from_pretrained(base_model, adapter_id) API as well.


## Choosing the right method

Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as is, the last sentence doesn't quite make sense, even though it's clear what is meant. Here is a suggestion for a different wording.

Suggested change
Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.
Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length; some methods are more prone to memory spikes than others.


Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).

## Chunked NLL loss

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this section last, I think the other ones below are more generally applicable.


## Quantization

Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incured by quantization methods. Read the [PEFT quantization guide](quantization).
Quantization is one of the best ways to reduce memory consumption *of the base model* and will, depending on the employed quantization, also reduce activation memory. Since the PEFT methods will only take up a small portion of the total number of parameters, PEFT defaults to use a higher precision than the base model. This can also have the effect that adapters can mitigate some of the quality loss incurred by quantization methods. Read the [PEFT quantization guide](quantization).


## Gradient Checkpointing

You can trade memory with computation by only saving every nth gradient between layers and computing the rest on the fly. Check out the [gradient checkpointing](https://huggingface.co/docs/transformers/grad_checkpointing) documentation of Transformers to learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that if not using Transformers or Diffusers, users may have to implement their own GC logic.

Giving general advice for training large models is hard but for generative
models, especially language models, you can follow these steps:

1. use prompting (few-shot examples in the prompt) to see if the model is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. use prompting (few-shot examples in the prompt) to see if the model is
1. use prompting (e.g. few-shot examples in the prompt) to see if the model is

fine-tuning step is potentially unlearning past knowledege.

The [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) aims to give a rough overview of (most) implemented methods on selected benchmarks and models.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be useful to mention some criteria here that may guide you in choosing the appropriate PEFT method:

  • quantization: not all methods support quantized base models
  • feature set: not all features are supported for all methods (e.g. multiple adapters, mixed adapter inference)
  • layer types: linear layers are generally always supported, but not all methods support embedding (important for expanding vocab) or conv (important for some image models)
  • inference runtime: PEFT methods generally add runtime overhead but some of that can be mitigated (e.g. some methods allow merging, removing the overhead)


## Layer Tuning

Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"target specific layers" doesn't make it quite clear that it means that existing parameters of the base model are made trainable, since you could say that LoRA also targets specific layers. I would state that explicitly.

@stevhliu stevhliu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! i focused mainly on memory_efficient-training and memory/overview in this pass

the ToC indentation for LoRA variants is broken

i think the doc-builder only supports 3 levels of nesting so maybe flatten the variants section?

new images are rendered with borders

the doc-builder automatically renders it with a border i believe. i would open an issue on the doc-builder repo for this :)


Low-Rank Adaptation ([LoRA](https://huggingface.co/papers/2106.09685)) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices. This drastically reduces the number of parameters that need to be fine-tuned.

The abstract from the paper is:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the wrong abstract

In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, LoRA is typically only applied to the attention blocks in Transformer models - it may be worth targeting other layers as well. The resulting number of trainable parameters in a LoRA model depends on the size of the update matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix.

## Utility
You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#Initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#Initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.
You can initialize the low-rank matrices with different use-cases in mind - task awareness (CoRDA, EVA), faster convergence (PiSSA), mitigating quantizations (LoftQ) - just to name a few use-cases. Read about the different initializations [below](#initialization). The default initialization is for LoRA to be a no-op, to gradually learn new behavior without interfering much with the existing model.

Giving general advice for training large models is hard but for generative
models, especially language models, you can follow these steps:

1. use prompting (few-shot examples in the prompt) to see if the model is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be nice to reorder the categories to be more consistent

  • the intro workflow lists prompting, layer tuning, adapters
  • the page sections are ordered adapters, prompting, layer tuning
  • the toctree orders them as layer tuning, soft prompting, adapters. it would also be good to pick and use the same terms in the sidebar and here (prompt-based methods vs soft prompting)


## Layer Tuning

Layer Tuning categorizes methods that target specific layers of a model such as [LayerNorm Tuning](../package_reference/layernorm_tuning)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section feels a bit thin. you could maybe add something about what distinguishes it more from prompting that makes it more expressive. otherwise, it'd be harder to pick between the two of them

and [adapter methods](#adapter-methods). These methods are generally
more expressive than prompt-based methods and get closer to full-finetuning.
3. Make sure to measure retention of already learnt knowledge since each
fine-tuning step is potentially unlearning past knowledege.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fine-tuning step is potentially unlearning past knowledege.
fine-tuning step is potentially unlearning past knowledge.


# Parameter efficient fine-tuning methods

Training a model parameter efficiently means to train as few parameters as possible to achieve comparable performance to training all parameters, i.e. full fine-tuning. There is, of course, no free lunch: by using fewer and therefore less expressive, parameters, it is not guaranteed that you will get the same performance! You may need to use a specific PEFT method to get optimal results for the model/task combination you want to train. But you will need less memory and possibly less compute during training and may gain features such as fast hot-swapping between trained expert models and less forgetting of previous knowledge compared to full fine-tuning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a link to hot-swapping to make it concrete

Suggested change
Training a model parameter efficiently means to train as few parameters as possible to achieve comparable performance to training all parameters, i.e. full fine-tuning. There is, of course, no free lunch: by using fewer and therefore less expressive, parameters, it is not guaranteed that you will get the same performance! You may need to use a specific PEFT method to get optimal results for the model/task combination you want to train. But you will need less memory and possibly less compute during training and may gain features such as fast hot-swapping between trained expert models and less forgetting of previous knowledge compared to full fine-tuning.
PEFT methods train as few parameters as possible while aiming for performance comparable to full fine-tuning. Fewer trainable parameters are less expressive, so the same performance isn't guaranteed. In exchange you use less memory, often less compute, and gain features like fast hot-swapping between expert adapters and less forgetting of prior knowledge.


Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.

When using [TRL] you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When using [TRL] you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.
When using [TRL](https://huggingface.co/docs/trl) you can either use the [Liger kernel integration](https://huggingface.co/docs/trl/liger_kernel_integration) or use [Chunked NLLLoss](https://huggingface.co/docs/trl/v1.5.1/en/reducing_memory_usage#chunked-cross-entropy-for-reducing-peak-memory-usage). The latter will split the sequence in chunks of size 256 to keep the maximum memory consumption constant.


# Memory Efficient Training

🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
🤗 PEFT provides you with methods for parameter efficient fine-tuning but that doesn't mean that your training process is memory efficient. This guide is a collection of tips that you can use to improve memory efficiency of your training process. This guide is mostly an overview page that will link you to the respective other guides and offer some tips for specific situations.
🤗 PEFT makes fine-tuning parameter efficient, but not automatically memory efficient. This overview collects tips for cutting training memory and links to the detailed guides.


## Chunked NLL loss

Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks, for that matter) but it is usually computed in one go, meaning you will allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.
Using [`NLLLoss`](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) is very common when training language models (or classification tasks). You allocate a matrix of size `batch × sequence × vocabulary`. With particularly long sequences or vocabularies this can get expensive fast.


Not every PEFT method is built equally and some formulations are easier to build in a memory efficient manner. If you are on a memory budget it makes sense to check out the [PEFT method comparison suite](https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison) and filter for **maximum** accelerator memory usage. Average accelerator memory usage can be fairly equal across methods but not every method scales equally with activations and sequence length and is more prone to memory spikes than others.

Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Especially when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens it might make sense to look into [using trainable tokens](troubleshooting#using-trainable-tokens).
Consider [using trainable tokens](troubleshooting#using-trainable-tokens) when targeting large layers like language modeling heads or embedding layers to fine-tune specific tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants